Red Wine Quality by AYSUN AKARSU

## 
## The downloaded binary packages are in
##  /var/folders/j2/1zcbb1js7r98zvcdd342lc_40000gn/T//RtmpTWGieV/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/j2/1zcbb1js7r98zvcdd342lc_40000gn/T//RtmpTWGieV/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/j2/1zcbb1js7r98zvcdd342lc_40000gn/T//RtmpTWGieV/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/j2/1zcbb1js7r98zvcdd342lc_40000gn/T//RtmpTWGieV/downloaded_packages
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Structure of the loaded red wine quality dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ ratings             : Factor w/ 3 levels "low","average",..: 2 2 2 2 2 2 2 3 3 2 ...

Structure of the loaded red wine quality dataset after removing column X which is for indexing and unnecessary

A new variable ratings is added to the dataset.

In the dataset, there are 1599 observations and 12 features.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18  
##     ratings    
##  low    :  63  
##  average:1319  
##  high   : 217  
##                
##                
## 

Above is the summary statistics of the dataset.

Quality

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Majority of the wines are in quality 5 and 6. Very few of them have low quality between 3 and 4. There are some of them with high quality between 7 and 8.

Majority of the wines have average quality. Very few of them have low quality. High rated wines are more than low rated but less than average rated wines.

Univariate Plots Section

The distribution of the red wines chemical properties values:

The boxplot of the red wines chemical properties values show us that residual.sugar, chlorides, sulphates, total.sulfur.dioxide have many outliers. Fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide, pH have few outliers.

Univariate Analysis

What is the structure of your dataset?

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine.

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Observations: - Most wines have medium quality (quality 5 and 6)

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature. We want to find out the factors which determine quality. There are some explanations in the dataset itself to explain the each chemical properties affect on the quality. For example by the outhors of dataset we know that high volatile acidity can lead to unpleasant taste like vinegar. However as we have done only univariate analysis we don’t know exactly how these chemical properties are related to the quality of wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

As we have done only univariate analysis we don’t know exactly how these chemical properties are related to the quality of wine.

Did you create any new variables from existing variables in the dataset?

A rating variable was created based on the quality. The wines with quality less than 5 are accepted as low rating, the wines with quality 5 qnd 6 are associated with the rating medium and the wine quality more that 6 belong to high rating.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, there are some unusual distributions such as chlorides, residual_sugar.

They are highly right skewed.

No, I haven’t done any operations on the data.

Bivariate Plots Section

Alcohol,sulphates, citric.acid, fixed.acidity, volatile.acidity have high correlations with with wine quality ratings. Ph, density, chlorides have correlations with with wine quality ratings.

Residual.sugar, chlorides, free.sulfur.dioxide do not have significant importantance in determining low and high quality wines.

Alcohol is positively correlated with ph while it is negatively correlated with density. Wines with high volatile.acidity produce low quality wines. Density is positively correlated with fixed.acidity. Fixed.acidity is positively correlated with citric.acid while negative correlated with Ph.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Alcohol, pH, sulphates, density, fixed.acidity, volatile.acidity, citric.acid are the main factors which determine low or high quality wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Alcohol has positive correlation with pH, negative correlation with density, total.sulfur.dioxide

PH has negative correlation with volatile.acidity

Density is positively correlated with fixed.acidity

Fixed.acidity is positive correlated with citric.acidity

What was the strongest relationship you found?

There are strong relationships between the quality of wine and alcohol and the acidity of the wine; fixed.acidity, volatile.acidity, citric.acid

Multivariate Plots Section

Alcohol and density are negatively correlated. As the alcohol increase the wine quality increases however low density wines produce high quality wines. High volatile acidity helps us determine low quality wines.

High quality wines have higher sulphates. Here again we see having high volatile acidity produce low quality wines.

Density and fixed.acidity features are positively correlated. Here again we see the high volatile acidity effect on low quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

By multivariate analysis, we are now sure that alcohol is an important chemical property defining the quality of red wine. In the dataset it is stated that high volatile.acidity will be associated with low quality wines. The graphs in this section prove that as the volatile.acidity increase the quality of the red wine decrease. Also, high quality wines have higher sulphates and low density.

Were there any interesting or surprising interactions between features?

Acidity is important in wine quality. High citric.acid and fixed.acidity tend to produce better quality wines as long as volatile.acidity is not high.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Most of the red wines in the dataset have average quality. There are very few wines with low quality but quite few with high quality.

Plot Two

Description Two

Alcohol, pH, sulphates, density, fixed.acidity, volatile.acidity, citric.acid are the main factors which determine low or high quality wines.

Plot Three

Description Three

Alcohol and volatile.acidity are important chemical properties defining the quality of red wine. The quality of the wine increase as the alcohol value increase on the contrary the quality of the red wine decrease by the increase of volatile.acidity. Other factors increasing the quality is low density.


Reflection

The red wine dataset contains 1599 observations with 11 features on the chemical properties. The main feature is the wine quality. We are interested in the chemical property features which determine wine quality. Below are our findings.

1 - Fixed acidity has positive correlation with wine quality unlike volatile acidity.

2 - Volatile acidity is important in determing wine quality and it is negatively correlated to wine quality. In our data analysis, we found out that low quality wines have high volatile density.

3 - Citric acid is positively correlated to wine quality unlike volatile acidity. Our data analysis shows that wines quality increase with citric acid increase.

4 - Residual sugar is not effective in determining the wine quality.

5 - Chlorides is not effective in determining the wine quality.

6 - Free sulfur dioxide doesn’t have significant effect on wine quality.

7 - Total sulfur dioxide doesn’t have significant effect on wine quality.

8 - Density determines the wine quality. The data suggest that good quality wines have low density.

9 - PH determines the wine quality. The data suggest that good quality wines have low pH.

10 - Sulphates is effective in determining the wine quality. Wines with higher sulphates have high quality.

11 - Alcohol is the most important factor determining the wine quality. The data strongly suggest that the higher the alcohol content, the more likely the better wine quality.

The red wine quality dataset is highly unbalanced. Most of the wines have average quality and there are very few low quality wines. More data with low and high quality wines can improve the quality of analysis. Some chemical properties which we decide by this data analysis as having no effect on wine quality may give different results.

Resources

https://www.r-bloggers.com/quick-plot-of-all-variables/

http://ggobi.github.io/ggally/#ggally